Lab 2 Overview
First Half
More Visualisations:
ggplot2 (ggmap)
HexBins vs SquareBins
Choropleths
Descriptive Statistics:
Histograms and looking at the distribution shape.
How to “quick plot” dataframes using ggplot2.
A brief summary:
Second Half
Revision:
- Any code related questions for R
Let’s get started
Load libraries
library(dplyr)
library(sf)
library(curl)
library(ggmap)
library(tmap)
library(tmaptools)
library(feather)Load data
## # A tibble: 6 x 19
## VendorID tpep_pickup_dateti… tpep_dropoff_datet… passenger_count trip_distance
## <int> <fct> <fct> <int> <dbl>
## 1 2 4/12/15 22:55 4/12/15 23:28 2 12.8
## 2 2 4/12/15 22:55 4/12/15 23:03 1 0.75
## 3 1 4/12/15 22:55 4/12/15 23:08 1 2.4
## 4 1 4/12/15 22:55 4/12/15 23:01 1 0.8
## 5 2 4/12/15 22:55 4/12/15 23:17 1 4.73
## 6 2 4/12/15 22:55 4/12/15 22:59 2 0.8
## # … with 14 more variables: pickup_longitude <dbl>, pickup_latitude <dbl>,
## # RatecodeID <int>, store_and_fwd_flag <fct>, dropoff_longitude <dbl>,
## # dropoff_latitude <dbl>, payment_type <int>, fare_amount <dbl>, extra <dbl>,
## # mta_tax <dbl>, tip_amount <dbl>, tolls_amount <dbl>,
## # improvement_surcharge <dbl>, total_amount <dbl>
ggplot2
It’s hard to succinctly describe how ggplot2 works because it embodies a deep philosophy of visualisation. However, in most cases you start with ggplot(), supply a dataset and aesthetic mapping (with aes()). You then add on layers (like geom_point() or geom_histogram()), scales (like scale_colour_brewer()), faceting specifications (like facet_wrap()) and coordinate systems (like coord_flip()).
library(ggplot2)
ggplot(df, aes(total_amount, trip_distance, colour = as.factor(payment_type))) +
geom_point()In spatial statistics the ability to visualize data and models superimposed with their basic social landmarks and geographic context is invaluable. ggmap is a new tool which enables visualization by combining the spatial information of static maps from Google Maps, OpenStreetMap, etc., with the layered grammar of graphics implementation of ggplot2.
Select the map.
xranges <- range(df$pickup_longitude[!df$pickup_longitude==0])
yranges <- range(df$pickup_latitude[!df$pickup_latitude==0])
xranges## [1] -77.04710 -71.06483
## [1] 37.27044 42.73614
ggmap(map_big) +
geom_point(data = df,
aes(x = dropoff_longitude,
y = dropoff_latitude),
colour="red", size = 0.05) +
geom_point(data = df,
aes(x = pickup_longitude,
y = pickup_latitude),
colour="blue", size = 0.05) Heatmaps
- Excellent tool when visualising geospatial coordinates.
Plot pickup locations.
Change the color scale.
ggmap(map_big) +
stat_bin2d(data = df,
aes(x = pickup_longitude,
y = pickup_latitude),
bins=50) +
scale_fill_viridis_c(trans = "log10")ggmap(map_big) +
stat_bin2d(data = df,
aes(x = pickup_longitude,
y = pickup_latitude),
bins=50) +
scale_fill_gradientn(colours = c("darkred", "orange", "yellow", "white"),trans="log10")ggmap(map_big) +
stat_bin2d(data = df,
aes(x = pickup_longitude,
y = pickup_latitude,
alpha = 0.95),
bins=50) +
scale_fill_gradientn(colours = c("darkred", "orange", "yellow", "white"),trans="log10")Filled by discrete factors.
ggmap(map_big) +
stat_bin2d(data = df,
aes(x = pickup_longitude,
y = pickup_latitude,
fill = as.factor(payment_type)),
bins=50) Change the color palette. A good palette makes your story better.
ggmap(map_big) +
stat_bin2d(data = df,
aes(x = pickup_longitude,
y = pickup_latitude,
fill = as.factor(payment_type)),
bins=100) +
scale_fill_brewer(palette = "Dark2")ggmap(map_big) +
stat_bin2d(data=df,
aes(x = pickup_longitude, y = pickup_latitude),
alpha = 0.9,
bins=100) +
scale_fill_viridis_c(trans = "log10") +
facet_grid(.~payment_type)Hexagon bins avoid the visual artefacts sometimes generated by the very regular alignment of geom_bin2d().
Plot HexBin.
ggmap(map_big) +
stat_binhex(data=df,
aes(x = pickup_longitude, y = pickup_latitude),
bins=100) +
coord_cartesian() +
scale_fill_gradient2(trans="log10")Shapefile Overlays
- NOTE: This only applies on the more recent datasets that use zones over coordinates
Requirements:
- ggplot2/ ggmap/ tmap/ leaflet
df_yellow = read.csv("../data/yellow_tripdata_2019-01.csv",stringsAsFactors = TRUE)
df_yellow %>%
group_by(PULocationID, payment_type) %>%
summarise(n = n()) -> df_yellow_summary## `summarise()` has grouped output by 'PULocationID'. You can override using the `.groups` argument.
Shapefile Links:
A shapefile is a simple, nontopological format for storing the geometric location and attribute information of geographic features. Geographic features in a shapefile can be represented by points, lines, or polygons (areas).
taxi_zones.zip taxi+_zone_lookup.csv
Basic information in shapefile.
## Reading layer `taxi_zones' from data source
## `/mnt/unimelb.edu.au/yyyou1/MAST30034_R/data/shapefile/taxi_zones'
## using driver `ESRI Shapefile'
## Simple feature collection with 263 features and 6 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: 913175.1 ymin: 120121.9 xmax: 1067383 ymax: 272844.3
## Projected CRS: NAD83 / New York Long Island (ftUS)
Choropleth mapping with ggplot2
ggplot() +
geom_sf(data = sf, size = 1, color = "black", fill = "cyan1") +
ggtitle("Boundary Plot") +
coord_sf()+
theme_bw()Adding basemaps with ggmap
ggmap(map_big) +
geom_sf(data = sf, aes(fill=Shape_Area), inherit.aes = FALSE) +
coord_sf(crs = st_crs(4326)) +
scale_fill_viridis_c()Choropleth with tmap
tmap is specifically designed to make creation of thematic maps more convenient. It borrows from thw ggplot syntax and takes care of a lot of the styling and aesthetics. This reduces our amount of code significantly. We only need:
tm_shape() where we provide the sf object (we could also provide an SpatialPolygonsDataframe)
tm_polygons() where we set the attribute variable to map, the break style, and a title.
Incorporate information we have based on pickup locations to the simple feature collection.
df_yellow_summary[!duplicated(df_yellow_summary$PULocationID),]-> df_yellow_summary
sf_join <-
left_join(sf,df_yellow_summary,
by=c("LocationID"="PULocationID")) ## tmap mode set to interactive viewing
Read more about Reprojecting geographic data.
Descriptive Statistics
## VendorID tpep_pickup_datetime tpep_dropoff_datetime
## Min. :1.000 4/12/15 19:14: 508 4/12/15 21:00: 487
## 1st Qu.:1.000 4/12/15 22:44: 506 4/12/15 21:40: 486
## Median :2.000 4/12/15 21:57: 503 4/12/15 20:14: 484
## Mean :1.528 4/12/15 22:46: 491 4/12/15 20:58: 480
## 3rd Qu.:2.000 4/12/15 19:39: 489 4/12/15 21:01: 480
## Max. :2.000 4/12/15 19:28: 486 4/12/15 20:31: 476
## (Other) :97017 (Other) :97107
## passenger_count trip_distance pickup_longitude pickup_latitude
## Min. :0.000 Min. : 0.000 Min. :-77.05 Min. : 0.00
## 1st Qu.:1.000 1st Qu.: 1.030 1st Qu.:-73.99 1st Qu.:40.73
## Median :1.000 Median : 1.710 Median :-73.98 Median :40.75
## Mean :1.704 Mean : 2.827 Mean :-72.88 Mean :40.14
## 3rd Qu.:2.000 3rd Qu.: 3.100 3rd Qu.:-73.97 3rd Qu.:40.77
## Max. :6.000 Max. :91.200 Max. : 0.00 Max. :42.74
##
## RatecodeID store_and_fwd_flag dropoff_longitude dropoff_latitude
## Min. : 1.000 N:99421 Min. :-74.62 Min. : 0.00
## 1st Qu.: 1.000 Y: 579 1st Qu.:-73.99 1st Qu.:40.73
## Median : 1.000 Median :-73.98 Median :40.75
## Mean : 1.029 Mean :-72.93 Mean :40.17
## 3rd Qu.: 1.000 3rd Qu.:-73.96 3rd Qu.:40.77
## Max. :99.000 Max. : 0.00 Max. :41.49
##
## payment_type fare_amount extra mta_tax
## Min. :1.00 Min. :-120.00 Min. :-1.0000 Min. :-0.5000
## 1st Qu.:1.00 1st Qu.: 7.00 1st Qu.: 0.5000 1st Qu.: 0.5000
## Median :1.00 Median : 10.00 Median : 0.5000 Median : 0.5000
## Mean :1.34 Mean : 12.75 Mean : 0.6049 Mean : 0.4985
## 3rd Qu.:2.00 3rd Qu.: 15.00 3rd Qu.: 0.5000 3rd Qu.: 0.5000
## Max. :4.00 Max. : 500.00 Max. : 1.5000 Max. : 2.5000
##
## tip_amount tolls_amount improvement_surcharge total_amount
## Min. : 0.00 Min. : 0.0000 Min. :-0.3000 Min. :-120.30
## 1st Qu.: 0.00 1st Qu.: 0.0000 1st Qu.: 0.3000 1st Qu.: 9.30
## Median : 1.46 Median : 0.0000 Median : 0.3000 Median : 12.80
## Mean : 1.80 Mean : 0.2386 Mean : 0.2997 Mean : 16.19
## 3rd Qu.: 2.55 3rd Qu.: 0.0000 3rd Qu.: 0.3000 3rd Qu.: 18.36
## Max. :115.00 Max. :24.0000 Max. : 0.3000 Max. : 550.30
##
Remember that not all the data should be interpreted as purely numerical. There may be conclusions you can draw by coincidence if you incorrectly assume data types! For example:
longitude and latitude should be interpreted as geospatial coordinates
payment_type is a discrete category of payment types
trip_distance is non-linear (not a straight line from A to B), but we have no further data on it. It is also in miles.
To avoid misinterpreting the attributes, refer to the data dictionary provided on the TLC website.
Scatterplot plots for Fare / Distance
General Inference:
- We can visually see that the relationship is relatively linear as you’d expect (more distance generally means more money)
- A fair number of outliers, notably around the 0 distance axis and 0 cost axis. Why might this be the case?
- Why are there negative values?
Histgram plots for Fare amount
- I guess we can kind of see most fares are between 0 - 100.
- Hard to tell where the main distribution is spread around.
- We can enhance this using a log transformation.
df$transform_fare_amount <- log(df$fare_amount)
ggplot(df)+
geom_histogram(aes(x=transform_fare_amount),bins = 30, fill="darkblue") +
theme_bw()- Take a log transformation to visually see the distribution
Do a correlation check.
## [1] 0.899978
library(pheatmap)
#remove characters
pheatmap(cor(df[,-c(1:3,9)]),
cluster_rows = FALSE,
cluster_cols = FALSE)- trip_distance highly correlates with high tips, tolls and overall trip amount
- payment_type seems to have some form of negative correlation with tip_amount. Must be careful as this is a discrete category.
cors =c("passenger_count", "trip_distance", "fare_amount", "extra",
"mta_tax", "tip_amount", "tolls_amount", "improvement_surcharge", "total_amount")
cor(df[,cors])## passenger_count trip_distance fare_amount extra
## passenger_count 1.0000000000 0.001658362 0.0009505592 -0.003093358
## trip_distance 0.0016583622 1.000000000 0.8999780236 -0.223323624
## fare_amount 0.0009505592 0.899978024 1.0000000000 -0.207022734
## extra -0.0030933583 -0.223323624 -0.2070227338 1.000000000
## mta_tax 0.0131316387 -0.027817098 -0.1282543385 0.130305153
## tip_amount -0.0111149872 0.530623359 0.5573861481 -0.100025231
## tolls_amount 0.0065090920 0.572059235 0.5307124782 -0.157824284
## improvement_surcharge 0.0040463644 0.015943787 0.0489414626 0.095486378
## total_amount -0.0007104667 0.898539215 0.9824415224 -0.186521180
## mta_tax tip_amount tolls_amount
## passenger_count 0.01313164 -0.01111499 0.006509092
## trip_distance -0.02781710 0.53062336 0.572059235
## fare_amount -0.12825434 0.55738615 0.530712478
## extra 0.13030515 -0.10002523 -0.157824284
## mta_tax 1.00000000 -0.06226707 -0.094005581
## tip_amount -0.06226707 1.00000000 0.359383270
## tolls_amount -0.09400558 0.35938327 1.000000000
## improvement_surcharge 0.59152689 0.01722265 0.004483847
## total_amount -0.12168476 0.68568923 0.603590674
## improvement_surcharge total_amount
## passenger_count 0.004046364 -0.0007104667
## trip_distance 0.015943787 0.8985392152
## fare_amount 0.048941463 0.9824415224
## extra 0.095486378 -0.1865211796
## mta_tax 0.591526892 -0.1216847619
## tip_amount 0.017222652 0.6856892276
## tolls_amount 0.004483847 0.6035906745
## improvement_surcharge 1.000000000 0.0486015276
## total_amount 0.048601528 1.0000000000
Other plots to present correlations.
## corrplot 0.90 loaded
‘AOE’ is for the angular order of the eigenvectors.
Study Payment type and Distance
library(forcats)
df = df %>%
mutate(payment_type = as_factor(payment_type)) %>%
mutate(payment_type = recode(payment_type,`1` = "Credit Card", `2` = "Cash", `3` = "No charge", `4` = "Dispute"))## # A tibble: 4 x 2
## f n
## <fct> <int>
## 1 Credit Card 66601
## 2 Cash 32955
## 3 No charge 336
## 4 Dispute 108
#Collapse Cash/No charge/Dispute
df = df %>%
mutate(payment_type = fct_collapse(payment_type,
Credit_Card = c("Credit Card"),
Cash_Or_Others=c("Cash","No charge","Dispute")))ggplot(df, aes(payment_type)) +
geom_bar(fill=c("pink","orange")) +
ylab("Frequency") +
ggtitle("Barplot of Payment Methods")+
theme_bw()ggplot(df) +
geom_histogram(aes(x=trip_distance, fill=payment_type),
breaks=seq(0,20, by=1))+
theme_bw()+
scale_fill_brewer(palette = "Paired") Use facets.
ggplot(df) +
geom_histogram(aes(x=trip_distance, fill=payment_type),
breaks=seq(0,20, by=1))+
theme_bw()+
facet_grid(.~payment_type) +
scale_fill_brewer(palette = "Paired")The default setting of position to draw two histgrams together is by stacking them instead of their identity.
Plot the information of payment type and trip distance together in one.
ggplot(df) +
geom_boxplot(aes(x=payment_type,
y=trip_distance,
fill=payment_type))+
scale_y_continuous(trans='log2') +
scale_fill_brewer(palette = "Paired") +
theme_bw()A brief summary
Visualisations
- Good visualizations vs Bad visualizastions
Personal Checklist for Visualisations and Dashboards:
Your visualisation needs to tell a story.
It should be intepretable without being overly verbose.
The scale and axis need to make sense (and you can assume the reader knows the difference between a normal scale vs log scale).
The choice of visualisation needs to make sense:
- Line plot vs Bar chart with non-numerical categories
- Scatterplot vs Histogram plot to see distribution
- etc
- Choice of colour scheme / alpha / size need to be easy on the eyes.
At the end of the day, even if you think your visualisation is “pretty” or “beautiful”, if a reader cannot understand it, then it is not a good visualisation.
Use Spark in R
See sparklyr tutorial.
SessionInfo
## R version 4.0.3 (2020-10-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.2 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
##
## locale:
## [1] LC_CTYPE=en_AU.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_AU.UTF-8 LC_COLLATE=en_AU.UTF-8
## [5] LC_MONETARY=en_AU.UTF-8 LC_MESSAGES=en_AU.UTF-8
## [7] LC_PAPER=en_AU.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] forcats_0.5.1 corrplot_0.90 pheatmap_1.0.12 RColorBrewer_1.1-2
## [5] feather_0.3.5 tmaptools_3.1-1 tmap_3.3-2 ggmap_3.0.0
## [9] ggplot2_3.3.5 curl_4.3.2 sf_1.0-1 dplyr_1.0.7
##
## loaded via a namespace (and not attached):
## [1] bitops_1.0-7 httr_1.4.2 tools_4.0.3
## [4] bslib_0.2.5.1 utf8_1.2.1 R6_2.5.0
## [7] KernSmooth_2.23-17 DBI_1.1.1 colorspace_2.0-2
## [10] raster_3.4-13 withr_2.4.2 sp_1.4-5
## [13] tidyselect_1.1.1 leaflet_2.0.4.1 compiler_4.0.3
## [16] leafem_0.1.6 cli_3.0.0 labeling_0.4.2
## [19] bookdown_0.22 sass_0.4.0 scales_1.1.1
## [22] classInt_0.4-3 hexbin_1.28.2 proxy_0.4-26
## [25] stringr_1.4.0 digest_0.6.27 rmarkdown_2.9
## [28] base64enc_0.1-3 dichromat_2.0-0 jpeg_0.1-8.1
## [31] pkgconfig_2.0.3 htmltools_0.5.1.1 highr_0.9
## [34] htmlwidgets_1.5.3 rlang_0.4.11 rstudioapi_0.13
## [37] jquerylib_0.1.4 generics_0.1.0 farver_2.1.0
## [40] jsonlite_1.7.2 crosstalk_1.1.1 magrittr_2.0.1
## [43] s2_1.0.6 Rcpp_1.0.7 munsell_0.5.0
## [46] fansi_0.5.0 abind_1.4-5 lifecycle_1.0.0
## [49] stringi_1.7.2 leafsync_0.1.0 yaml_2.2.1
## [52] plyr_1.8.6 grid_4.0.3 parallel_4.0.3
## [55] crayon_1.4.1 lattice_0.20-41 stars_0.5-3
## [58] hms_1.1.0 knitr_1.33 pillar_1.6.1
## [61] rjson_0.2.20 codetools_0.2-16 wk_0.5.0
## [64] XML_3.99-0.6 glue_1.4.2 evaluate_0.14
## [67] leaflet.providers_1.9.0 png_0.1-7 vctrs_0.3.8
## [70] rmdformats_1.0.2 RgoogleMaps_1.4.5.3 gtable_0.3.0
## [73] purrr_0.3.4 tidyr_1.1.3 assertthat_0.2.1
## [76] xfun_0.24 lwgeom_0.2-6 e1071_1.7-7
## [79] class_7.3-17 viridisLite_0.4.0 tibble_3.1.2
## [82] units_0.7-2 ellipsis_0.3.2